The author explores the common frustration of running local Large Language Models (LLMs), where the gap between potential and usability is often caused by slow inference speeds. Instead of upgrading to larger, more complex models, the author discovered that implementing speculative decoding significantly improved the experience. This technique uses a smaller "draft" model to quickly predict tokens, which a larger "verification" model then checks. This process drastically increases speed and creates a smoother conversational flow without sacrificing the model's intelligence. By focusing on how models are run rather than just which models are used, users can make their self-hosted AI tools much more practical for daily use.
Zed introduces edit prediction powered by Zeta, an open-source model that anticipates developers' next edits, enhancing efficiency. The feature allows users to apply predicted edits with a single keystroke, integrating seamlessly with existing functionalities like language server completions. The article also covers methodologies like supervised fine-tuning, direct preference optimization, and speculative decoding to minimize latency, ensuring a fast editing experience.